The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you’d like.
Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops.)
Note:
Throughout this document, any season column
represents the year each season started. For example, the 2015-16 season
will be in the dataset as 2015. For most of the rest of the project, we
will refer to a season by just this number (e.g. 2015) instead of the
full text (e.g. 2015-16).
Question 1:
Question 2: XX.X Years
Question 3:
Open Ended Modeling Question: Please show your work and leave all responses below in the document.
Question 1: XX.X%
Question 2: Written question, put answer below in the
document.
Question 3: Written question, put answer below in the
document.
library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project,
# the paths will likely be fixed for you by deleting ../../Data/awards_project/ from each string.
awards <- read_csv("/Users/zhenyan/Downloads/awards_data.csv")
player_data <- read_csv("/Users/zhenyan/Downloads/player_stats.csv")
team_data <- read_csv("/Users/zhenyan/Downloads/team_stats.csv")
rebounding_data <- read_csv("/Users/zhenyan/Downloads/team_rebounding_data_22.csv")
In this section, you’re going to work with data relating to player awards and statistics. You’ll start with some data manipulation questions and work towards building a model to predict broad levels of career success.
QUESTION: What is the average number of points per game for players in the 2007-2021 seasons who won All NBA First, Second, and Third Teams (not the All Defensive Teams), as well as for players who were in the All-Star Game (not the rookie all-star game)?
# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.
library(dplyr)
# Merge player_data and awards
merged_data <- player_data %>%
left_join(awards, by = c("season", "nbapersonid"),relationship="many-to-many")
# Filter based on criteria and compute average points per game for each
first_team_avg_data <- merged_data %>%
filter(season >= 2007 & season <= 2021, `All NBA First Team` == 1) %>%
summarise(avg = mean(points / games, na.rm = TRUE))
first_team_avg <- first_team_avg_data$avg
second_team_avg_data <- merged_data %>%
filter(season >= 2007 & season <= 2021, `All NBA Second Team` == 1) %>%
summarise(avg = mean(points / games, na.rm = TRUE))
second_team_avg <- second_team_avg_data$avg
third_team_avg_data <- merged_data %>%
filter(season >= 2007 & season <= 2021, `All NBA Third Team` == 1) %>%
summarise(avg = mean(points / games, na.rm = TRUE))
third_team_avg <- third_team_avg_data$avg
allstar_team_avg_data <- merged_data %>%
filter(season >= 2007 & season <= 2021, all_star_game == TRUE) %>%
summarise(avg = mean(points / games, na.rm = TRUE))
allstar_team_avg <- allstar_team_avg_data$avg
round(first_team_avg,1)
## [1] 25.9
round(second_team_avg,1)
## [1] 23.1
round(third_team_avg,1)
## [1] 20.5
round(allstar_team_avg,1)
## [1] 21.6
ANSWER 1:
1st Team: 25.9 points per game
2nd Team: 23.1 points per game
3rd Team: 20.5 points per game
All-Star: 21.6 points per game
QUESTION: What was the average number of years of experience in the league it takes for players to make their first All NBA Selection (1st, 2nd, or 3rd team)? Please limit your sample to players drafted in 2007 or later who did eventually go on to win at least one All NBA selection. For example:
library(dplyr)
# Filter players drafted in 2007 or later
drafted_after_2006 <- player_data %>%
filter(draftyear >= 2007)
# Identify players with at least one All NBA selection
all_nba_players <- awards %>%
filter(
(`All NBA First Team` == 1 | `All NBA Second Team` == 1 | `All NBA Third Team` == 1) &
nbapersonid %in% drafted_after_2006$nbapersonid
)
# Calculate difference between draft year and the first All NBA award
years_to_first_award <- all_nba_players %>%
group_by(nbapersonid) %>%
summarise(first_award_year = min(season)) %>%#first award year
left_join(drafted_after_2006, by = "nbapersonid") %>%
mutate(years_to_first = first_award_year - draftyear)
# Compute average
avg_years_to_first_award <- mean(years_to_first_award$years_to_first, na.rm = TRUE)
print(round(avg_years_to_first_award,1))
## [1] 3.8
ANSWER 2:
3.8 Years
You’re going to work to create a dataset with a “career outcome” for each player, representing the highest level of success that the player achieved for at least two seasons after his first four seasons in the league (examples to follow below!). To do this, you’ll start with single season level outcomes. On a single season level, the outcomes are:
We need to make an adjustment for determining Starter/Rotation qualifications for a few seasons that didn’t have 82 games per team. Assume that there were 66 possible games in the 2011 lockout season and 72 possible games in each of the 2019 and 2020 seasons that were shortened due to covid. Specifically, if a player played 900 minutes in 2011, he would meet the rotation criteria because his final minutes would be considered to be 900 * (82/66) = 1118. Please use this math for both minutes and games started, so a player who started 38 games in 2019 or 2020 would be considered to have started 38 * (82/72) = 43 games, and thus would qualify for starting 41. Any answers should be calculated assuming you round the multiplied values to the nearest whole number.
Note that on a season level, a player’s outcome is the highest level of success he qualifies for in that season. Thus, since Shai Gilgeous-Alexander was both All-NBA 1st team and an All-Star last year, he would be considered to be “Elite” for the 2022 season, but would still qualify for a career outcome of All-Star if in the rest of his career he made one more All-Star game but no more All-NBA teams. Note this is a hypothetical, and Shai has not yet played enough to have a career outcome.
Examples:
QUESTION: There are 73 players in the
player_data dataset who have 2010 listed as their draft
year. How many of those players have a career outcome
in each of the 6 buckets?
#adjustments functions
adjust_games_started <- function(year, games) {
if (year == 2011) {
adjusted_games <- round(games * (82/66))
} else if (year %in% c(2019, 2020)) {
adjusted_games <- round(games * (82/72))
} else {
adjusted_games <- games
}
return(adjusted_games)
}
adjust_minutes_played <- function(year, minutes) {
if (year == 2011) {
adjusted_minutes <- round(minutes * (82/66))
} else if (year %in% c(2019, 2020)) {
adjusted_minutes <- round(minutes * (82/72))
} else {
adjusted_minutes <- minutes
}
return(adjusted_minutes)
}
merged_data <- merged_data %>%
rowwise() %>%
mutate(
adjusted_games_started = adjust_games_started(season, games_start),
adjusted_minutes_played = adjust_minutes_played(season, mins)
) %>%
ungroup()
#Players with 2010 as draft year
drafted_2010 <- merged_data %>% filter(draftyear == 2010)
#Assign single season outcomes
drafted_2010 <- drafted_2010 %>%
mutate(season_outcome = case_when(
`All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
all_star_game == TRUE ~ "All-Star",
(adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
adjusted_minutes_played >= 1000 ~ "Rotation",
adjusted_minutes_played >= 1 ~ "Roster",
TRUE ~ "Out of the League"
))
#Assign career outcomes
career_outcomes <- drafted_2010 %>%
group_by(nbapersonid) %>%
slice(5:n()) %>%
count(season_outcome) %>%
top_n(n = 1, wt = n) %>%
arrange(-n) %>%
summarize(career_outcome = case_when(
any(season_outcome == "Elite") ~ "Elite",
any(season_outcome == "All-Star") ~ "All-Star",
any(season_outcome == "Starter") ~ "Starter",
any(season_outcome == "Rotation") ~ "Rotation",
any(season_outcome == "Roster") ~ "Roster",
TRUE ~ "Out of the League"
))
#Count number of players of each bucket
bucket_counts <- career_outcomes %>%
count(career_outcome)
bucket_counts
## # A tibble: 5 × 2
## career_outcome n
## <chr> <int>
## 1 All-Star 1
## 2 Elite 1
## 3 Roster 61
## 4 Rotation 2
## 5 Starter 8
ANSWER 3:
Elite: 1 players.
All-Star: 1 players.
Starter: 8 players.
Rotation: 2 players.
Roster: 61 players.
Out of League: 0 players.
In this question, you will work to build a model to predict a player’s career outcome based on information up through the first four years of his career.
This question is intentionally left fairly open ended, but here are some notes and specifications.
We know modeling questions can take a long time, and that qualified candidates will have different levels of experience with “formal” modeling. Don’t be discouraged. It’s not our intention to make you spend excessive time here. If you get your model to a good spot but think you could do better by spending a lot more time, you can just write a bit about your ideas for future improvement and leave it there. Further, we’re more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery, and a successful candidate could use a simple regression approach.
You may use any data provided in this project, but please do not bring in any external sources of data. Note that while most of the data provided goes back to 2007, All NBA and All Rookie team voting is only included back to 2011.
A player needs to complete at least three additional seasons after their first four to be considered as having a distinct career outcome for our dataset. (We are using 3+ instead of 2+ just to give each player a little more time to accumulate high level seasons before we classify his career). Because the dataset in this project ends in 2021, this means that a player would need to have had the chance to play in the ’21, ’20, and ’19 seasons after his first four years, and thus his first four years would have been ’18, ’17, ’16, and ’15. For this reason, limit your training data to players who were drafted in or before the 2015 season. Karl-Anthony Towns was the #1 pick in that season.
Once you build your model, predict on all players who were drafted in 2018-2021 (They have between 1 and 4 seasons of data available and have not yet started accumulating seasons that inform their career outcome).
You can predict a single career outcome for each player, but it’s better if you can predict the probability that each player falls into each outcome bucket.
Include, as part of your answer:
reactable) containing all predictions for the players
drafted in 2019-2021.library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(tidyverse)
# Preprocess merged_data
merged_data <- merged_data %>%
left_join(career_outcomes, by = "nbapersonid") %>%
replace_na(list(career_outcomes = "Out of the League"))
# For training set
data_train <- merged_data %>%
filter(draftyear <= 2015) %>%
group_by(nbapersonid) %>%
slice(1:4)
# Assign single season outcomes
data_train <- data_train %>%
mutate(season_outcome = case_when(
`All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
all_star_game == TRUE ~ "All-Star",
(adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
adjusted_minutes_played >= 1000 ~ "Rotation",
adjusted_minutes_played >= 1 ~ "Roster",
TRUE ~ "Out of the League"
))
# Compute career_outcomes for each nbapersonid
career_outcomes <- data_train %>%
group_by(nbapersonid) %>%
slice(5:n()) %>%
count(season_outcome) %>%
top_n(n = 1, wt = n) %>%
summarize(career_outcome = case_when(
any(season_outcome == "Elite") ~ "Elite",
any(season_outcome == "All-Star") ~ "All-Star",
any(season_outcome == "Starter") ~ "Starter",
any(season_outcome == "Rotation") ~ "Rotation",
any(season_outcome == "Roster") ~ "Roster",
TRUE ~ "Out of the League"
)) %>%
ungroup()
# Handle NA values in player_data
player_data[is.na(player_data)] <- 0
numeric_columns_player_data <- sapply(player_data, is.numeric)
avg_player_data_numeric <- player_data %>%
filter(draftyear <= 2015) %>%
select(which(numeric_columns_player_data)) %>%
group_by(nbapersonid, season) %>%
summarise_all(mean, na.rm=TRUE) %>%
ungroup()
player_aggregated_data <- avg_player_data_numeric %>%
group_by(nbapersonid) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
ungroup()
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `nbapersonid = 15`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
# Join with career_outcomes to get training_set
training_set <- career_outcomes %>%
left_join(player_aggregated_data, by = "nbapersonid")
head(training_set)
## # A tibble: 6 × 48
## nbapersonid career_outcome season draftyear draftpick nbateamid games
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15 Roster 2007 1994 15 1610612756 16
## 2 87 Roster 2008. 1991 4 1610612745 24
## 3 109 Roster 2007 1992 11 1610612759 45
## 4 136 Roster 2007 1992 29 1610612738 18
## 5 185 Roster 2007 1993 1 1610612744 9
## 6 208 Roster 2007 1993 24 1610612742 27.5
## # ℹ 41 more variables: games_start <dbl>, mins <dbl>, fgm <dbl>, fga <dbl>,
## # fgp <dbl>, fgm3 <dbl>, fga3 <dbl>, fgp3 <dbl>, fgm2 <dbl>, fga2 <dbl>,
## # fgp2 <dbl>, efg <dbl>, ftm <dbl>, fta <dbl>, ftp <dbl>, off_reb <dbl>,
## # def_reb <dbl>, tot_reb <dbl>, ast <dbl>, steals <dbl>, blocks <dbl>,
## # tov <dbl>, tot_fouls <dbl>, points <dbl>, PER <dbl>, FTr <dbl>,
## # off_reb_pct <dbl>, def_reb_pct <dbl>, tot_reb_pct <dbl>, ast_pct <dbl>,
## # stl_pct <dbl>, blk_pct <dbl>, tov_pct <dbl>, usg <dbl>, OWS <dbl>, …
# Handle NA values for numeric columns in merged_data
numeric_cols <- sapply(merged_data, is.numeric)
merged_data[numeric_cols] <- lapply(merged_data[numeric_cols], function(x) ifelse(is.na(x), 0, x))
# Handle NA values for character or factor columns in merged_data
char_or_factor_cols <- sapply(merged_data, function(x) is.character(x) | is.factor(x))
merged_data[char_or_factor_cols] <- lapply(merged_data[char_or_factor_cols], function(x) ifelse(is.na(x), "Unknown", x))
# Preprocess test set
merged_data <- merged_data %>%
left_join(career_outcomes, by = "nbapersonid") %>%
replace_na(list(career_outcomes = "Out of the League"))
data_train <- merged_data %>%
filter(draftyear >= 2018 & draftyear <= 2021) %>%
group_by(nbapersonid) %>%
slice(1:4)
# Assign single season outcomes
data_train <- data_train %>%
mutate(season_outcome = case_when(
`All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
all_star_game == TRUE ~ "All-Star",
(adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
adjusted_minutes_played >= 1000 ~ "Rotation",
adjusted_minutes_played >= 1 ~ "Roster",
TRUE ~ "Out of the League"
))
# Compute career_outcomes for each nbapersonid
career_outcomes <- data_train %>%
group_by(nbapersonid) %>%
slice(5:n()) %>%
count(season_outcome) %>%
top_n(n = 1, wt = n) %>%
summarize(career_outcome = case_when(
any(season_outcome == "Elite") ~ "Elite",
any(season_outcome == "All-Star") ~ "All-Star",
any(season_outcome == "Starter") ~ "Starter",
any(season_outcome == "Rotation") ~ "Rotation",
any(season_outcome == "Roster") ~ "Roster",
TRUE ~ "Out of the League"
)) %>%
ungroup()
# Handle NA values in player_data
player_data[is.na(player_data)] <- 0
numeric_columns_player_data <- sapply(player_data, is.numeric)
avg_player_data_numeric <- player_data %>%
filter(draftyear >= 2018 & draftyear <= 2021) %>%
select(which(numeric_columns_player_data)) %>%
group_by(nbapersonid, season) %>%
summarise_all(mean, na.rm=TRUE) %>%
ungroup()
player_aggregated_data <- avg_player_data_numeric %>%
group_by(nbapersonid) %>%
summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
ungroup()
# Join with career_outcomes to get test_set
test_set <- career_outcomes %>%
left_join(player_aggregated_data, by = "nbapersonid")
player_names <- player_data %>%
select(nbapersonid, player)
test_set <- test_set %>%
left_join(player_names, by = "nbapersonid")
head(test_set)
## # A tibble: 6 × 49
## nbapersonid career_outcome season draftyear draftpick nbateamid games
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1628238 Roster 2021 2018 0 1610612756 2
## 2 1628959 Roster 2018 2018 0 1610612741 10
## 3 1628960 Starter 2020. 2018 21 1610612759. 48
## 4 1628960 Starter 2020. 2018 21 1610612759. 48
## 5 1628960 Starter 2020. 2018 21 1610612759. 48
## 6 1628960 Starter 2020. 2018 21 1610612759. 48
## # ℹ 42 more variables: games_start <dbl>, mins <dbl>, fgm <dbl>, fga <dbl>,
## # fgp <dbl>, fgm3 <dbl>, fga3 <dbl>, fgp3 <dbl>, fgm2 <dbl>, fga2 <dbl>,
## # fgp2 <dbl>, efg <dbl>, ftm <dbl>, fta <dbl>, ftp <dbl>, off_reb <dbl>,
## # def_reb <dbl>, tot_reb <dbl>, ast <dbl>, steals <dbl>, blocks <dbl>,
## # tov <dbl>, tot_fouls <dbl>, points <dbl>, PER <dbl>, FTr <dbl>,
## # off_reb_pct <dbl>, def_reb_pct <dbl>, tot_reb_pct <dbl>, ast_pct <dbl>,
## # stl_pct <dbl>, blk_pct <dbl>, tov_pct <dbl>, usg <dbl>, OWS <dbl>, …
# For the training set
training_set <- training_set %>%
mutate(
# 1. Shooting Efficiency
shooting_efficiency = ifelse(fga != 0, fgm / fga, 0),
# 2. Three-point Shooting Efficiency
three_point_efficiency = ifelse(fga3 != 0, fgm3 / fga3, 0),
# 3. Free Throw Efficiency
free_throw_efficiency = ifelse(fta != 0, ftm / fta, 0),
# 4. Points per Minute
points_per_minute = ifelse(mins != 0, points / mins, 0),
# 5. Points per Game
points_per_game = ifelse(games != 0, points / games, 0),
# 6. Minutes per Game
mins_per_game = ifelse(games != 0, mins / games, 0),
# 7. Year-on-Year Improvement
ppg = points_per_game,
yoy_improvement = ppg - lag(ppg, 1)
) %>%
ungroup()
# For the test set
test_set <- test_set %>%
mutate(
# 1. Shooting Efficiency
shooting_efficiency = ifelse(fga != 0, fgm / fga, 0),
# 2. Three-point Shooting Efficiency
three_point_efficiency = ifelse(fga3 != 0, fgm3 / fga3, 0),
# 3. Free Throw Efficiency
free_throw_efficiency = ifelse(fta != 0, ftm / fta, 0),
# 4. Points per Minute
points_per_minute = ifelse(mins != 0, points / mins, 0),
# 5. Points per Game
points_per_game = ifelse(games != 0, points / games, 0),
# 6. Minutes per Game
mins_per_game = ifelse(games != 0, mins / games, 0),
# 7. Year-on-Year Improvement
ppg = points_per_game,
yoy_improvement = ppg - lag(ppg, 1)
) %>%
ungroup()
#lasso regression to determine which features are chosen
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-7
# Encode career_outcome using label encoding
training_set$career_outcome <- as.numeric(factor(training_set$career_outcome))
# Split data into features and target
X <- training_set %>% select(-nbapersonid, -career_outcome,-draftyear,-season,-draftpick,-nbateamid)
y <- training_set$career_outcome
# Impute missing values with median for numeric columns
X_imputed <- X
num_cols <- sapply(X, is.numeric)
X_imputed[, num_cols] <- lapply(X_imputed[, num_cols], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))
# Standardize the imputed features
X_scaled <- scale(X_imputed)
# Fit Lasso model
lasso_model <- glmnet(X_scaled, y, alpha = 1, lambda = cv.glmnet(X_scaled, y, alpha = 1)$lambda.min)
# Extract important features
important_features <- rownames(coef(lasso_model))[which(coef(lasso_model) != 0)]
important_features
## [1] "(Intercept)" "games" "games_start"
## [4] "mins" "fgp" "ftm"
## [7] "ftp" "ast" "steals"
## [10] "tov" "tot_fouls" "FTr"
## [13] "ast_pct" "OWS" "DWS"
## [16] "VORP" "points_per_minute"
set.seed(1)
sample <- sample(c(TRUE, FALSE), nrow(X_scaled), replace=TRUE, prob=c(0.7,0.3))
training_data <- training_set[sample,]
cross_validation <- training_set[!sample,]
#convert to data frame
training_data <- as.data.frame(training_data)
cross_validation <- as.data.frame(cross_validation)
library(nnet)
# 1. Prepare the data with the selected features
selected_features <- c("games", "games_start", "mins", "fgp2", "efg", "ftm", "ast", "steals",
"tov", "tot_fouls", "PER", "FTr", "ast_pct", "stl_pct", "blk_pct",
"tov_pct", "OWS", "DWS", "VORP", "three_point_efficiency",
"free_throw_efficiency", "points_per_minute", "points_per_game")
X_selected <- training_data[, selected_features]
y <- as.factor(training_data$career_outcome) # ensure the response variable is a factor
# 2. Fit the multiple logistic regression model
multinom_model <- multinom(y ~ ., data = data.frame(y, X_selected))
## # weights: 125 (96 variable)
## initial value 1260.189885
## iter 10 value 773.399946
## iter 20 value 675.757956
## iter 30 value 606.207788
## iter 40 value 554.578863
## iter 50 value 421.473587
## iter 60 value 384.268803
## iter 70 value 380.118241
## iter 80 value 379.096946
## iter 90 value 378.368910
## iter 100 value 377.529767
## final value 377.529767
## stopped after 100 iterations
# 3. Summarize the model to interpret the results
#summary(multinom_model)
# 4. Performance on cross validation set
actual_outcomes <- as.numeric(cross_validation$career_outcome)
prediction <- predict(multinom_model,newdata=cross_validation)
accuracy <- sum(prediction == actual_outcomes, na.rm = TRUE)/length(actual_outcomes)
cat("The accurarcy is:",accuracy)#0.7402985
## The accurarcy is: 0.7402985
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
#fit the model
data_clean <- na.omit(data.frame(y, X_selected))
rf_model <- randomForest(y ~ ., data = data_clean, ntree = 2000)
#summary
#summary(rf_model)
# 4. Performance on cross validation set
actual_outcomes <- as.numeric(cross_validation$career_outcome)
prediction <- predict(rf_model,newdata=cross_validation)
accuracy <- sum(prediction == actual_outcomes, na.rm = TRUE)/length(actual_outcomes)
cat("The accurarcy is:",accuracy)#0.7492537,better
## The accurarcy is: 0.7492537
multinom_prediction <- predict(multinom_model,test_set)
rf_prediction <- predict(rf_model,test_set)
#multinom_prediction
#rf_prediction
#with probabilities
multinom_prediction_prob <- predict(multinom_model,test_set,type='prob')
rf_prediction_prob <- predict(rf_model,test_set,type='prob')
#multinom_prediction_prob
#rf_prediction_prob
library(tibble)
labels <- c("Elite", "All-Star", "Starter", "Rotation", "Roster")
# Identify and remove duplicate players, retaining only unique player indices
unique_player_indices <- !duplicated(test_set$player)
player_indices <- which(unique_player_indices)
# Filter predictions and probabilities for the specific players
multinom_preds_for_players <- multinom_prediction[player_indices]
rf_preds_for_players <- rf_prediction[player_indices]
multinom_probs_for_players <- multinom_prediction_prob[player_indices, ]
rf_probs_for_players <- rf_prediction_prob[player_indices, ]
# Convert probabilities into a readable format
multinom_probs_readable <- apply(multinom_probs_for_players, 1, function(row) {
paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(rf_probs_for_players, 1, function(row) {
paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
# Construct the data frame
predictions_df <- tibble(
Player = test_set$player[player_indices],
Multinom_Predictions = labels[as.numeric(multinom_preds_for_players)],
RF_Predictions = labels[as.numeric(rf_preds_for_players)],
Multinom_Probabilities = multinom_probs_readable,
RF_Probabilities = rf_probs_readable
)
print(predictions_df)
## # A tibble: 396 × 5
## Player Multinom_Predictions RF_Predictions Multinom_Probabilities
## <chr> <chr> <chr> <chr>
## 1 Paris Bass Starter Starter Elite: 0 | All-Star: …
## 2 Rawle Alkins Starter Starter Elite: 0 | All-Star: …
## 3 Grayson Allen Starter Starter Elite: 0 | All-Star: …
## 4 Kostas Antetokoun… Starter Starter Elite: 0 | All-Star: …
## 5 Udoka Azubuike Starter Starter Elite: 0 | All-Star: …
## 6 Marvin Bagley Starter Starter Elite: 0 | All-Star: …
## 7 Mohamed Bamba Starter Starter Elite: 0 | All-Star: …
## 8 Keita Bates-Diop Starter Starter Elite: 0 | All-Star: …
## 9 Brian Bowen II Starter Starter Elite: 0 | All-Star: …
## 10 Mikal Bridges Roster Roster Elite: 0 | All-Star: …
## # ℹ 386 more rows
## # ℹ 1 more variable: RF_Probabilities <chr>
Understanding the future trajectory of an NBA player is a multifaceted challenge. Our model is designed to demystify this process by harnessing data from players’ early careers to forecast their potential.
We have rigorously analyzed historical data, focusing on players who began their careers before 2015. Key performance metrics, including games played, starting appearances, minutes on the court, shooting accuracy, defensive contributions, and several others, have been meticulously incorporated. These parameters have been chosen because they consistently demonstrate a strong correlation with long-term player success.
Utilizing these metrics, our predictive model offers an informed assessment of the career trajectories for players who debuted between 2018 and 2021. In rigorous testing environments, our model demonstrated a commendable accuracy of 75%. While no predictive tool can claim infallibility, we believe our model serves as an invaluable asset, providing robust, data-driven insights to guide front office decisions regarding player potential.
In essence, our approach marries historical insights with advanced analytics to offer a strategic perspective on player potential, assisting your team in making enlightened decisions for the future of your franchise.
####2. Strenth and weakness of the model:
Strengths:
Comprehensive Metrics: Our model evaluates a wide array of on-court metrics, providing a holistic view of a player’s performance.
Historical Context: By examining players who started before 2015, we draw from a robust dataset that covers various playing styles and eras.
High Predictive Accuracy: With a commendable 75% accuracy rate in test environments, our model proves to be both reliable and efficient.
Weaknesses:
Data Limitations: Our dataset only encompasses data from before 2015. As a result, recent shifts in the NBA landscape—such as the emphasis on three-point shooting or evolving defensive schemes—might not be fully captured.
Overlooked Variables: Sports, especially a dynamic game like basketball, have various externalities. Our model may not account for factors such as player injuries, changing team dynamics, coaching styles, or off-court issues that can influence a player’s career trajectory.
Improvement Pathways:
To address these shortcomings, our next steps could involve:
Expanding Data Range: Incorporating data from recent years would help capture newer trends and patterns in the NBA, making our predictions more relevant. Inclusion of External Factors: We recognize the importance of non-statistical elements. Future iterations could include more qualitative data or secondary metrics to evaluate off-court influences, team dynamics, and other intangibles. Model advanced model: with more rows of data, we can developed deep learning model to do the classfication
library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Multinomial Logistic Regression
probs_multinom <- predict(multinom_model, newdata = cross_validation, type = "probs")
# Random Forest
probs_rf <- predict(rf_model, newdata = cross_validation, type = "prob")
roc_data <- data.frame()
classes <- levels(factor(cross_validation$career_outcome))
# Loop over each class
for(class in classes){
# Convert actual outcomes to a binary format: "PositiveClass" for the current class, "Other" otherwise
binary_outcome <- ifelse(cross_validation$career_outcome == class, "PositiveClass", "Other")
# Compute the ROC curve for the multinomial model with direction specified
roc_multinom <- roc(binary_outcome, probs_multinom[, class], levels=c("Other", "PositiveClass"), quiet = TRUE)
# Compute the ROC curve for the random forest model with direction specified
roc_rf <- roc(binary_outcome, probs_rf[, class], levels=c("Other", "PositiveClass"), quiet=TRUE)
# Append ROC data to the data frame
roc_data <- rbind(roc_data,
data.frame(Model="Multinom", Class=class, TPR=roc_multinom$sensitivities, FPR=roc_multinom$specificities),
data.frame(Model="Random Forest", Class=class, TPR=roc_rf$sensitivities, FPR=roc_rf$specificities)
)
}
# Checking the head of roc_data again
head(roc_data)
## Model Class TPR FPR
## 1 Multinom 1 1 0.000000000
## 2 Multinom 1 1 0.003058104
## 3 Multinom 1 1 0.006116208
## 4 Multinom 1 1 0.009174312
## 5 Multinom 1 1 0.012232416
## 6 Multinom 1 1 0.015290520
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
# Define colors
colors <- c("Multinom" = "#002c53", "Random Forest" = "#ffa510")
plots <- list()
for(class in classes){
binary_outcome <- ifelse(cross_validation$career_outcome == class, "PositiveClass", "Other")
auc_multinom <- auc(roc(binary_outcome, probs_multinom[, class], levels=c("Other", "PositiveClass")))
auc_rf <- auc(roc(binary_outcome, probs_rf[, class], levels=c("Other", "PositiveClass")))
# Create a subset of roc_data for the specific class
subset_data <- roc_data[roc_data$Class == class,]
# Create the ggplot object for the specific class
p <- ggplot(subset_data, aes(x = FPR, y = TPR, color = Model)) +
geom_line(size = 1) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
labs(title = paste("ROC Curve for Class", class, "with AUC", "\nMultinom AUC:", round(auc_multinom, 3), "Random Forest AUC:", round(auc_rf, 3)),
x = "False Positive Rate", y = "True Positive Rate") +
theme_minimal() +
theme(legend.position = "bottom") +
scale_color_manual(values = colors, name = "Model")
# Convert the ggplot object to a plotly object and append to the plots list
plots[[class]] <- ggplotly(p)
}
## Setting direction: controls < cases
## Setting direction: controls < cases
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
#view
plots[[1]]
plots[[2]]
plots[[3]]
plots[[4]]
plots[[5]]
specific_players <- c("Shai Gilgeous-Alexander", "Zion Williamson", "James Wiseman", "Josh Giddey")
# Get unique indices for specific players
unique_player_indices <- which(!duplicated(test_set$player) & test_set$player %in% specific_players)
# Update the predictions and probabilities using the unique indices
multinom_preds_for_players <- multinom_prediction[unique_player_indices]
rf_preds_for_players <- rf_prediction[unique_player_indices]
multinom_probs_for_players <- multinom_prediction_prob[unique_player_indices, ]
rf_probs_for_players <- rf_prediction_prob[unique_player_indices, ]
# Convert probabilities into a readable format
multinom_probs_readable <- apply(multinom_probs_for_players, 1, function(row) {
paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(rf_probs_for_players, 1, function(row) {
paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
# Construct the tibble
specific_predictions_df <- tibble(
Player = specific_players,
Multinom_Predictions = labels[as.numeric(multinom_preds_for_players)],
RF_Predictions = labels[as.numeric(rf_preds_for_players)],
Multinom_Probabilities = multinom_probs_readable,
RF_Probabilities = rf_probs_readable
)
print(specific_predictions_df)
## # A tibble: 4 × 5
## Player Multinom_Predictions RF_Predictions Multinom_Probabilities
## <chr> <chr> <chr> <chr>
## 1 Shai Gilgeous-Alex… Roster Roster Elite: 0.065 | All-St…
## 2 Zion Williamson All-Star Elite Elite: 0.019 | All-St…
## 3 James Wiseman Starter Starter Elite: 0 | All-Star: …
## 4 Josh Giddey Starter Roster Elite: 0.025 | All-St…
## # ℹ 1 more variable: RF_Probabilities <chr>
library(reactable)
my_labels <- c("Elite", "All-Star", "Starter", "Rotation", "Roster")
# Identify the rows for players drafted in 2019-2021:
draft_years <- c(2019, 2020, 2021)
filtered_row_numbers <- which(test_set$draftyear %in% draft_years)
# Extract predictions
drafted_multinom_preds <- multinom_prediction[filtered_row_numbers]
drafted_rf_preds <- rf_prediction[filtered_row_numbers]
drafted_multinom_probs <- multinom_prediction_prob[filtered_row_numbers, ]
drafted_rf_probs <- rf_prediction_prob[filtered_row_numbers, ]
# Convert probabilities to readable format
multinom_probs_readable <- apply(drafted_multinom_probs, 1, function(row) {
paste(my_labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(drafted_rf_probs, 1, function(row) {
paste(my_labels, round(row, 3), sep=": ", collapse=" | ")
})
# Construct the tibble
predictions_df <- tibble(
Player = test_set$player[filtered_row_numbers],
Multinom_Predictions = labels[as.numeric(drafted_multinom_preds)],
RF_Predictions = labels[as.numeric(drafted_rf_preds)],
Multinom_Probabilities = multinom_probs_readable,
RF_Probabilities = rf_probs_readable
)
unique_predictions_df <- distinct(predictions_df, Player, .keep_all = TRUE)
reactable(unique_predictions_df)
In this section, we’re going to introduce a simple way to predict team offensive rebound percent in the next game and then discuss ways to improve those predictions.
Using the rebounding_data dataset, we’ll predict a
team’s next game’s offensive rebounding percent to be their average
offensive rebounding percent in all prior games. On a single game level,
offensive rebounding percent is the number of offensive rebounds divided
by their number offensive rebound “chances” (essentially the team’s
missed shots). On a multi-game sample, it should be the total number of
offensive rebounds divided by the total number of offensive rebound
chances.
Please calculate what OKC’s predicted offensive rebound percent is for game 81 in the data. That is, use games 1-80 to predict game 81.
# Filter out OKC's games 1-80
okc_data <- rebounding_data %>%
filter(team == "OKC" & game_number <= 80)
# Calculate total offensive rebounds and total rebound chances
total_offensive_rebounds <- sum(okc_data$offensive_rebounds)
total_rebound_chances <- sum(okc_data$off_rebound_chances)
# Predict the offensive rebound percentage for game 81
predicted_oreb_pct_for_game_81 <- total_offensive_rebounds / total_rebound_chances
print(predicted_oreb_pct_for_game_81)
## [1] 0.2886898
# Display the result
cat("Based on OKC's performance in games 1-80, the predicted offensive rebounding percentage for game 81 is approximately",
round(predicted_oreb_pct_for_game_81 * 100, 1), "%", "\n")
## Based on OKC's performance in games 1-80, the predicted offensive rebounding percentage for game 81 is approximately 28.9 %
ANSWER 1:
28.9%
There are a few limitations to the method we used above. For example, if a team has a great offensive rebounder who has played in most games this season but will be out due to an injury for the next game, we might reasonably predict a lower team offensive rebound percent for the next game.
Please discuss how you would think about changing our original model to better account for missing players. You do not have to write any code or implement any changes, and you can assume you have access to any reasonable data that isn’t provided in this project. Try to be clear and concise with your answer.
ANSWER 2:
When accounting for the absence of key players, particularly those who significantly impact a team’s performance, our model would need more sophistication. Here’s how we might approach the task:
1.Individual Player Metrics: We should first analyze the rebounding statistics on a player-by-player basis, not just at the team level. This will allow us to understand the contribution of each player to the overall team’s offensive rebound percentage.
2.Player Impact Factor: For each player, calculate an “Impact Factor” based on their contribution to the team’s offensive rebounding. This factor might be a combination of:
Their average offensive rebounds per game. The percentage of team’s total offensive rebounds they are responsible for. Their overall presence on the court, like minutes played. Player Availability Data: We would need a dataset that tells us about player availability for each game. This dataset would include details about injuries, suspensions, or any other reason a player might miss a game.
3.Dynamic Prediction: Based on the availability of players for game 81:
Subtract the ‘Impact Factor’ of the missing key players from the team’s overall offensive rebound percentage to predict a revised percentage for that game. If backups or replacements are known to be playing, their ‘Impact Factor’ could be added, though it’s essential to note that a backup player might not perform at the same level in a starting role. Historical Analysis: If the key player has missed games before, analyze the team’s offensive rebounding performance during those games to understand the real-world impact. This historical evidence can be used to adjust our predictions further.
4.Consider Team Strategy and Tactics: Teams might adjust their strategies based on the players available. If we can analyze how strategies have changed in past games when key players were absent, we can incorporate these tactical adjustments into our predictions.
By integrating these elements, our model will be better suited to predict the impact of missing players and provide a more accurate forecast for a team’s performance.
In question 2, you saw and discussed how to deal with one weakness of the model. For this question, please write about 1-3 other potential weaknesses of the simple average model you made in question 1 and discuss how you would deal with each of them. You may either explain a weakness and discuss how you’d fix that weakness, then move onto the next issue, or you can start by explaining multiple weaknesses with the original approach and discuss one overall modeling methodology you’d use that gets around most or all of them. Again, you do not need to write any code or implement any changes, and you can assume you have access to any reasonable data that isn’t provided in this project. Try to be clear and concise with your answer.
ANSWER 3:
Solution: Implement a weighted average that gives more importance to recent games. For instance, games closer to the 81st game might be given a higher weight, suggesting their performance is more indicative of the upcoming game.
Solution: Apply statistical methods to detect and possibly remove or adjust for outliers. Alternatively, using a median or a trimmed mean might provide a more representative central value for the team’s performance, as it reduces the influence of extreme values.
Solution: Incorporate an opponent strength metric. If a team consistently performed well in rebounds against strong opponents, it might be a sign that they have a strong rebounding strategy or talent. We can adjust the rebound percentage based on the strength of the opponent they’re about to face in game 81.
Overall Modeling Methodology: To address the majority of these weaknesses, a regression model could be employed. By incorporating time as a variable, opponent strength, and other potential predictors, we can derive a more dynamic and responsive prediction for game 81. This approach provides a framework that can be iteratively refined by including more relevant variables or adjusting for nuances in the data.